City of San Francisco Trees

Imagine you've been commissioned by the City of San Francisco to tackle a problem they've been having with local flora. The parks department has taken extensive documentation of the city's trees since the 1970s - what species are growing, where they are, who they're maintained by - amassing a dataset of over 200K trees in that time.

The funding for that project has recently been called into question, and the City Board needs to see its value in reapproving funds for the following year. Stakeholders have raised several concerns over the past few years, and your job is to use the data to answer them. Good luck!

Jupyter Notebook

First things first, let's get some terminology straight.

  • The language we're working in – Python 3.7
  • The editor we're using is Google Colab – The code runs on Google's servers, and shows the results on our browser
  • This file is an interactive Python notebook, a .ipynb file. These are pretty special, also known as Jupyter notebooks.

Jupyter notebooks have a few special properties that make it ideal for work with data:

  • Code is organized into cells, which can be code or markdown
  • We can run the cells in any order, try it out!
  • The last item returned in a cell will print automatically, no need to wrap it with print()
In [1]:
x = 'Answer to the Ultimate Question of Life, the Universe, and Everything'
In [2]:
print(x) # Run this cell after running the one above, and again after running the one below
Answer to the Ultimate Question of Life, the Universe, and Everything
In [3]:
x = 42

Anything you can do in Python, you can do here!

  1. Write a function that takes a string as input, and does something to it
  2. In a new cell, call the function and test it out
In [4]:
def UltimateQuestion(computer_name):
    return computer_name + ' is thinking...'
In [5]:
UltimateQuestion('Deep Thought')
Out[5]:
'Deep Thought is thinking...'

Importing packages

We use the pandas package to easily work with data as tables.
The numpy package allows us to work with some other special data types, like missing values

We'll rename these as pd and np, just so its easier to refer to later on

In [33]:
import pandas as pd
import numpy as np

Importing data

For this semester, we'll typically work with data in tabular format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a .csv file ending, short for comma seperated values.

For example, a CSV file could look something like...

tree_number, species_name, address
312, Magnolia grandiflora, 2828 Divisadero St
124, Melaleuca quinquenervia, 485 Union St
912, Pittosporum undulatum, 47 Vicksburg St

To import this, let's use the pd.read_csv() function:

In [7]:
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-1/workshop/trees.csv'
trees = pd.read_csv(url)

Here, we've saved the data to a dataframe object named trees

In [8]:
type(trees)
Out[8]:
pandas.core.frame.DataFrame

DataFrames contain our data in little "spreadsheet"-like structures. Whatever manipulations you can think of doing to the data, you can likely search how to do

Exploring dataframes

Let's take a look at the data. We'll use the function .head() to read in the first 5 rows

In [9]:
trees.head()
Out[9]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
0 30314 DPW Maintained Private 16.0 NaN Pittosporum undulatum Victorian Box 1955-10-20 Sidewalk Cutout 37.759772 -122.398109 501 Arkansas St
1 30321 DPW Maintained Private 2.0 NaN Magnolia grandiflora Southern Magnolia 1956-01-06 Sidewalk Cutout 37.795718 -122.441860 2828 Divisadero St
2 30334 DPW Maintained Private 4.0 NaN Ginkgo biloba Maidenhair Tree 1956-02-06 Sidewalk Cutout 37.743222 -122.433634 601 29th St
3 30335 DPW Maintained Private 2.0 NaN Ginkgo biloba Maidenhair Tree 1956-02-06 Sidewalk Cutout 37.743226 -122.433565 601 29th St
4 30333 DPW Maintained Private 1.0 NaN Arbutus 'Marina' Hybrid Strawberry Tree 1956-02-06 Sidewalk Cutout 37.743217 -122.433721 601 29th St

How big is the dataset? .shape returns a tuple with the dimensions as (rows, columns)

In [10]:
trees.shape
Out[10]:
(36073, 13)

Let's try to understand our data a bit better.

  • How many different tree species are in the dataset?
In [11]:
trees.species_name.nunique()
Out[11]:
367
  • Which tree shows up the most frequently?
In [12]:
trees.common_name.value_counts()
Out[12]:
Swamp Myrtle              2781
Brisbane Box              2751
Hybrid Strawberry Tree    1968
Victorian Box             1604
Southern Magnolia         1602
                          ... 
Flooded Box: Coolibah        1
Fuji Apple Tree 'Fuji'       1
Cabbage tree                 1
Cabada palm                  1
Little-Leaf Azara            1
Name: common_name, Length: 365, dtype: int64

Show the biggest trees by sorting the dataframe:
Note: dbh records diameter of the tree base

In [13]:
trees.sort_values(by='dbh', ascending=False)
Out[13]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
34738 14513 DPW Maintained DPW 100.0 4X4 Fraxinus uhdei Shamel Ash: Evergreen Ash 2018-06-18 Sidewalk Cutout 37.776560 -122.446728 501 Masonic Ave
28183 12738 DPW Maintained DPW 100.0 4x4 Tristaniopsis laurina 'Elegant' Small-leaf Tristania 'Elegant' 2013-07-12 Sidewalk Cutout 37.786183 -122.477196 1630 Lake St
5025 4768 DPW Maintained DPW 100.0 3X3 Corymbia ficifolia Red Flowering Gum 1993-01-05 Sidewalk Cutout 37.732715 -122.385231 26 Commer Ct
17964 24961 DPW Maintained DPW 90.0 20 Phoenix canariensis Canary Island Date Palm 2005-04-21 Median Cutout 37.767709 -122.426675 100 Dolores St
5581 13104 DPW Maintained DPW 90.0 3X3 Ficus retusa nitida Banyan Fig 1993-10-26 Sidewalk Cutout 37.801143 -122.426724 1530 Lombard St
... ... ... ... ... ... ... ... ... ... ... ... ... ...
14101 78518 DPW Maintained Private 0.0 NaN Prunus cerasifera Cherry Plum 2000-10-21 Sidewalk Cutout 37.710295 -122.450931 75 Laura St
14114 78567 DPW Maintained Private 0.0 NaN Arbutus 'Marina' Hybrid Strawberry Tree 2000-10-21 Sidewalk Cutout 37.710306 -122.453138 40 Sears St
14763 44728 DPW Maintained Private 0.0 NaN Melaleuca quinquenervia Cajeput 2001-04-03 Sidewalk Cutout 37.748648 -122.477643 1144 Quintara St
14796 44797 DPW Maintained Private 0.0 NaN Prunus serrulata Ornamental Cherry 2001-04-12 Sidewalk Cutout 37.765145 -122.480368 1206 22nd Ave
36072 144192 DPW Maintained Private 0.0 Width 4ft Lophostemon confertus Brisbane Box 2020-01-25 Sidewalk Cutout 37.776940 -122.502697 618 42nd Ave

36073 rows × 13 columns

Subsetting

Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:

We can filter rows from a dataframe based on some condition

  • Show only Cherry Plum trees
In [14]:
trees[trees.common_name == 'Cherry Plum']
Out[14]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
149 53700 Permitted Site Private 14.0 NaN Prunus cerasifera Cherry Plum 1970-03-04 Sidewalk Cutout 37.746081 -122.426025 263 Duncan St
198 54020 DPW Maintained Private 13.0 NaN Prunus cerasifera Cherry Plum 1972-04-07 Sidewalk Cutout 37.772780 -122.494875 862 35th Ave
208 54057 DPW Maintained Private 8.0 NaN Prunus cerasifera Cherry Plum 1972-04-21 Sidewalk Cutout 37.772551 -122.494860 874 35th Ave
265 54255 Permitted Site Private 10.0 3x3 Prunus cerasifera Cherry Plum 1972-07-03 Sidewalk Cutout 37.759509 -122.442802 191 Caselli Ave
364 221734 DPW Maintained Private 12.0 Width 4ft Prunus cerasifera Cherry Plum 1972-08-17 Sidewalk Cutout 37.765292 -122.452934 203 Carl St
... ... ... ... ... ... ... ... ... ... ... ... ... ...
35535 55973 DPW Maintained Private 3.0 NaN Prunus cerasifera Cherry Plum 2019-06-10 Sidewalk Cutout 37.791259 -122.432719 2221 Webster St
35571 236272 DPW Maintained Private 3.0 Width 3ft Prunus cerasifera Cherry Plum 2019-07-26 Sidewalk Cutout 37.766989 -122.416495 99 Shotwell St
35572 236271 DPW Maintained Private 3.0 Width 3ft Prunus cerasifera Cherry Plum 2019-07-26 Sidewalk Cutout 37.767032 -122.416501 99 Shotwell St
35700 246210 DPW Maintained Private 3.0 Width 0ft Prunus cerasifera Cherry Plum 2019-10-01 Sidewalk Cutout 37.767967 -122.443800 725 Buena Vista Ave West
35701 246211 DPW Maintained Private 3.0 Width 0ft Prunus cerasifera Cherry Plum 2019-10-01 Sidewalk Cutout 37.767917 -122.443821 725 Buena Vista Ave West

1180 rows × 13 columns

How would you show only trees north of Golden Gate Park (latitude > 37.77285)

Hint: Same way as matching if statements in python, mirroring the syntax above

In [15]:
trees[trees.latitude > 37.77285]
Out[15]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
1 30321 DPW Maintained Private 2.0 NaN Magnolia grandiflora Southern Magnolia 1956-01-06 Sidewalk Cutout 37.795718 -122.441860 2828 Divisadero St
5 30339 DPW Maintained Private 11.0 NaN Platanus x hispanica Sycamore: London Plane 1956-02-15 Sidewalk Cutout 37.793189 -122.441380 2560 Divisadero St
6 30337 DPW Maintained Private 12.0 NaN Platanus x hispanica Sycamore: London Plane 1956-02-15 Sidewalk Cutout 37.793242 -122.441395 2560 Divisadero St
7 30341 DPW Maintained Private 10.0 NaN Acacia melanoxylon Blackwood Acacia 1956-02-15 Sidewalk Cutout 37.805913 -122.437521 3789 Fillmore St
20 30418 DPW Maintained Private 12.0 NaN Platanus x hispanica Sycamore: London Plane 1956-03-26 Sidewalk Cutout 37.797295 -122.440879 2509 Filbert St
... ... ... ... ... ... ... ... ... ... ... ... ... ...
36068 144227 DPW Maintained Private 0.0 Width 4ft Agonis flexuosa Peppermint Willow 2020-01-25 Sidewalk Cutout 37.773933 -122.503557 782 43rd Ave
36069 144230 DPW Maintained Private 0.0 Width 4ft Melaleuca quinquenervia Cajeput 2020-01-25 Sidewalk Cutout 37.775598 -122.503676 696 43rd Ave
36070 261517 DPW Maintained Private 3.0 Width 3ft Agonis flexuosa Peppermint Willow 2020-01-25 Sidewalk Yard 37.775886 -122.501730 679 41st Ave
36071 144157 DPW Maintained Private 0.0 Width 4ft Tristaniopsis laurina Swamp Myrtle 2020-01-25 Sidewalk Cutout 37.774642 -122.501452 746 41st Ave
36072 144192 DPW Maintained Private 0.0 Width 4ft Lophostemon confertus Brisbane Box 2020-01-25 Sidewalk Cutout 37.776940 -122.502697 618 42nd Ave

15811 rows × 13 columns

Data Manipulation

What is the average diameter of the Evergreen Pear tree?

In [16]:
trees[trees.common_name == 'Evergreen Pear'].dbh.mean()
Out[16]:
5.306595365418895
In [29]:
trees.groupby(by='common_name').agg('mean')['dbh'].sort_values(ascending=False).head(20)
Out[29]:
common_name
Date palm (species unknown)          70.000000
False Avocado                        35.000000
Canary Island Date Palm              30.912664
Flooded Box: Coolibah                30.000000
Morton Bay Fig                       29.000000
Douglas Fir                          26.333333
Moraine Ash                          26.000000
Burgundy Sweet Gum                   24.000000
Yucca                                23.000000
Beefwood: Drooping She-Oak           22.666667
Bloodgood London Plane               21.750000
Norfolk Island Pine                  20.333333
Shamel Ash: Evergreen Ash            20.294118
Poplar Spp                           18.000000
Nichol's Willow-Leafed Peppermint    17.387097
Silver Mountain Gum Tree             17.000000
Silk Oak Tree 'Red Hooks'            16.200000
Siberian Elm                         16.105263
Lombardy Poplar                      16.000000
Blue Gum                             15.250000
Name: dbh, dtype: float64

Visualization

First things first, let's import the package to help us visualize the data, plotly.

If this package isn't yet included, we can install it using !pip install plotly. More on this week 5.

In [32]:
import plotly.express as px

Note that we're using the sub package of the broader package, called plotly express. This simplifies a lot of the more difficult steps

Plotly express has a broad range of options to play with, let's take a look at the documentation.
Do a quick google search to pull up documentation for px.scatter OR run px.scatter? in a Jupyter cell

In [36]:
px.scatter?
Signature:
px.scatter(
    data_frame=None,
    x=None,
    y=None,
    color=None,
    symbol=None,
    size=None,
    hover_name=None,
    hover_data=None,
    custom_data=None,
    text=None,
    facet_row=None,
    facet_col=None,
    facet_col_wrap=0,
    error_x=None,
    error_x_minus=None,
    error_y=None,
    error_y_minus=None,
    animation_frame=None,
    animation_group=None,
    category_orders={},
    labels={},
    color_discrete_sequence=None,
    color_discrete_map={},
    color_continuous_scale=None,
    range_color=None,
    color_continuous_midpoint=None,
    symbol_sequence=None,
    symbol_map={},
    opacity=None,
    size_max=None,
    marginal_x=None,
    marginal_y=None,
    trendline=None,
    trendline_color_override=None,
    log_x=False,
    log_y=False,
    range_x=None,
    range_y=None,
    render_mode='auto',
    title=None,
    template=None,
    width=None,
    height=None,
)
Docstring:
    In a scatter plot, each row of `data_frame` is represented by a symbol
    mark in 2D space.
    
Parameters
----------
data_frame: DataFrame or array-like or dict
    This argument needs to be passed for column names (and not keyword
    names) to be used. Array-like and dict are tranformed internally to a
    pandas DataFrame. Optional: if missing, a DataFrame gets constructed
    under the hood using the other arguments.
x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the x axis in cartesian coordinates. For
    horizontal histograms, these values are used as inputs to `histfunc`.
y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the y axis in cartesian coordinates. For vertical
    histograms, these values are used as inputs to `histfunc`.
color: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign color to marks.
symbol: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign symbols to marks.
size: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign mark sizes.
hover_name: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in bold
    in the hover tooltip.
hover_data: list of str or int, or Series or array-like
    Either names of columns in `data_frame`, or pandas Series, or
    array_like objects Values from these columns appear as extra data in
    the hover tooltip.
custom_data: list of str or int, or Series or array-like
    Either names of columns in `data_frame`, or pandas Series, or
    array_like objects Values from these columns are extra data, to be used
    in widgets or Dash callbacks for example. This data is not user-visible
    but is included in events emitted by the figure (lasso selection etc.)
text: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in the
    figure as text labels.
facet_row: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the vertical direction.
facet_col: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the horizontal direction.
facet_col_wrap: int
    Maximum number of facet columns. Wraps the column variable at this
    width, so that the column facets span multiple rows. Ignored if 0, and
    forced to 0 if `facet_row` or a `marginal` is set.
error_x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars. If `error_x_minus` is `None`, error bars will
    be symmetrical, otherwise `error_x` is used for the positive direction
    only.
error_x_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars in the negative direction. Ignored if `error_x`
    is `None`.
error_y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars. If `error_y_minus` is `None`, error bars will
    be symmetrical, otherwise `error_y` is used for the positive direction
    only.
error_y_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars in the negative direction. Ignored if `error_y`
    is `None`.
animation_frame: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to animation frames.
animation_group: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    provide object-constancy across animation frames: rows with matching
    `animation_group`s will be treated as if they describe the same object
    in each frame.
category_orders: dict with str keys and list of str values (default `{}`)
    By default, in Python 3.6+, the order of categorical values in axes,
    legends and facets depends on the order in which these values are first
    encountered in `data_frame` (and no order is guaranteed by default in
    Python below 3.6). This parameter is used to force a specific ordering
    of values per column. The keys of this dict should correspond to column
    names, and the values should be lists of strings corresponding to the
    specific display order desired.
labels: dict with str keys and str values (default `{}`)
    By default, column names are used in the figure for axis titles, legend
    entries and hovers. This parameter allows this to be overridden. The
    keys of this dict should correspond to column names, and the values
    should correspond to the desired label to be displayed.
color_discrete_sequence: list of str
    Strings should define valid CSS-colors. When `color` is set and the
    values in the corresponding column are not numeric, values in that
    column are assigned colors by cycling through `color_discrete_sequence`
    in the order described in `category_orders`, unless the value of
    `color` is a key in `color_discrete_map`. Various useful color
    sequences are available in the `plotly.express.colors` submodules,
    specifically `plotly.express.colors.qualitative`.
color_discrete_map: dict with str keys and str values (default `{}`)
    String values should define valid CSS-colors Used to override
    `color_discrete_sequence` to assign a specific colors to marks
    corresponding with specific values. Keys in `color_discrete_map` should
    be values in the column denoted by `color`.
color_continuous_scale: list of str
    Strings should define valid CSS-colors This list is used to build a
    continuous color scale when the column denoted by `color` contains
    numeric data. Various useful color scales are available in the
    `plotly.express.colors` submodules, specifically
    `plotly.express.colors.sequential`, `plotly.express.colors.diverging`
    and `plotly.express.colors.cyclical`.
range_color: list of two numbers
    If provided, overrides auto-scaling on the continuous color scale.
color_continuous_midpoint: number (default `None`)
    If set, computes the bounds of the continuous color scale to have the
    desired midpoint. Setting this value is recommended when using
    `plotly.express.colors.diverging` color scales as the inputs to
    `color_continuous_scale`.
symbol_sequence: list of str
    Strings should define valid plotly.js symbols. When `symbol` is set,
    values in that column are assigned symbols by cycling through
    `symbol_sequence` in the order described in `category_orders`, unless
    the value of `symbol` is a key in `symbol_map`.
symbol_map: dict with str keys and str values (default `{}`)
    String values should define plotly.js symbols Used to override
    `symbol_sequence` to assign a specific symbols to marks corresponding
    with specific values. Keys in `symbol_map` should be values in the
    column denoted by `symbol`.
opacity: float
    Value between 0 and 1. Sets the opacity for markers.
size_max: int (default `20`)
    Set the maximum mark size when using `size`.
marginal_x: str
    One of `'rug'`, `'box'`, `'violin'`, or `'histogram'`. If set, a
    horizontal subplot is drawn above the main plot, visualizing the
    x-distribution.
marginal_y: str
    One of `'rug'`, `'box'`, `'violin'`, or `'histogram'`. If set, a
    vertical subplot is drawn to the right of the main plot, visualizing
    the y-distribution.
trendline: str
    One of `'ols'` or `'lowess'`. If `'ols'`, an Ordinary Least Squares
    regression line will be drawn for each discrete-color/symbol group. If
    `'lowess`', a Locally Weighted Scatterplot Smoothing line will be drawn
    for each discrete-color/symbol group.
trendline_color_override: str
    Valid CSS color. If provided, and if `trendline` is set, all trendlines
    will be drawn in this color.
log_x: boolean (default `False`)
    If `True`, the x-axis is log-scaled in cartesian coordinates.
log_y: boolean (default `False`)
    If `True`, the y-axis is log-scaled in cartesian coordinates.
range_x: list of two numbers
    If provided, overrides auto-scaling on the x-axis in cartesian
    coordinates.
range_y: list of two numbers
    If provided, overrides auto-scaling on the y-axis in cartesian
    coordinates.
render_mode: str
    One of `'auto'`, `'svg'` or `'webgl'`, default `'auto'` Controls the
    browser API used to draw marks. `'svg`' is appropriate for figures of
    less than 1000 data points, and will allow for fully-vectorized output.
    `'webgl'` is likely necessary for acceptable performance above 1000
    points but rasterizes part of the output.  `'auto'` uses heuristics to
    choose the mode.
title: str
    The figure title.
template: or dict or plotly.graph_objects.layout.Template instance
    The figure template name or definition.
width: int (default `None`)
    The figure width in pixels.
height: int (default `600`)
    The figure height in pixels.

Returns
-------
    A `Figure` object.
File:      /anaconda3/lib/python3.7/site-packages/plotly/express/_chart_types.py
Type:      function
In [21]:
trees_sample = trees.sample(frac=.2)
In [23]:
fig = px.scatter(trees_sample, x='date', y='dbh')
fig.show('notebook')

Clearly, there aren't any obvious trends going on from this view. Let's add in some more parameters

In [31]:
fig = px.scatter(trees_sample, x='date', y='dbh', 
                 opacity=.15, color='site_location', 
                 hover_name='common_name', hover_data=['site_location','site_type','address'],
                 marginal_x = 'histogram', marginal_y = 'histogram',
                 color_discrete_sequence = px.colors.qualitative.Prism[4:],
                 labels={'site_location':'Site Location', 'dbh':'Tree Diameter', 'date':'Date Recorded'}
                )
fig.show('notebook')

Geographic Plots

The transportation department wants to know track any trees sitting on a road median, in order to quickly remove debris after a bad storm.

  • Is there a general area in which there are more roadside / median trees?
  • Could you show the address, caretaker, and name of the tree on hover?
In [28]:
fig = px.scatter_mapbox(trees_sample, lat='latitude', lon='longitude', mapbox_style="stamen-terrain", zoom=11, 
                        color='site_location', size='dbh', opacity=.3,
                        color_discrete_sequence=['orange','red','orange','orange','orange','orange'],
                        hover_name='address',hover_data=['site_location','caretaker'],
                        labels={'site_location':'Site Location', 'dbh':'Tree Diameter', 
                                'date':'Date Recorded', 'caretaker':'Care Taker'}

                       )
fig.show('notebook')